Getting started with read QC
…BRIEF INTRO IN PROGRESS…
Snakemake workflow for read QC
A tentative snakemake workflow that defines read quality control rules in a DAG (directed acyclic graph) format. A detailed interactive snakemake report is available here. Use a wider screen to get a better interactive snakemake report.
Some potential QC tools
- Seqkit
- Fastqc
- MultiQC
- BBDuk script
- Trimmomatic
- Kneaddata
Some QC resources
- Adapter fasta files
- PhiX fasta file
Tool dictionary (environment.yml)
name: readqc
channels:
- bioconda
- biobakery
dependencies:
- seqkit =2.3.1
- fastqc =0.12.1
- multiqc =1.14
- bbmap =39.01
- trimmomatic =0.39
- knead-data =0.12.0
conda activate base
mamba install -c bioconda -c conda-forge -n readqc -file environment.yml
Read simple statistics
Assuming that the seqkit installation was successful, we
can use it to get the simple statistics of the reads. Later we will use
the seqkit output to prepare sample mapping files
automatically.
- If the files are uncompressed, we can save space by compressing them.
- Let’s navigate to the folder containing the fastq files and compress
them using
gzipfunction.
gzip *.fastq
From this point forward, we will assume that all the fastq files are in fastq.gz format.
#!/bin/bash
echo PROGRESS: Getting stats of the raw reads.
INPUTDIR="resources/reads"
SEQKIT="results/qc/seqkit1"
mkdir -p "${SEQKIT}"
seqkit stat "${INPUTDIR}"/*.fastq.gz >"${SEQKIT}"/seqkit_stats.txt
Read Quality Control
- Assuming that most QC tools are ready, it is time to use them to do
the following:
- Check the quality of the reads using
fastqc. - Create a summary report of quality metrics using
multiqc. - Trim poor read at a user-specified cutoff using
bbduk.sh. - Remove contaminants
bbduk.sh.
- Check the quality of the reads using
QC on raw reads
QC after trimming poor reads
QC after removing contaminated reads
Processed read status
References
[1]
Buza, T. M., Tonui, T., Stomeo, F., Tiambo, C.,
Katani, R., Schilling, M., … Kapur, V. (2019). iMAP: An integrated
bioinformatics and visualization pipeline for microbiome data analysis.
BMC Bioinformatics, 20. https://doi.org/10.1186/S12859-019-2965-4
Appendix
Project main tree
.
├── LICENSE
├── README.md
├── Rplots.pdf
├── config
│  ├── config.yml
│  ├── pbs
│  ├── pe_samples.tsv
│  ├── pe_units.tsv
│  ├── se_samples.tsv
│  ├── se_units.tsv
│  └── slurm
├── dags
│  ├── rulegraph.png
│  └── rulegraph.svg
├── images
│  ├── funnels.png
│  ├── project_tree.txt
│  ├── qc_hist.png
│  ├── qc_hist.svg
│  ├── samples_hist.png
│  ├── samples_hist.svg
│  └── smkreport
├── imap-read-quality-control.Rproj
├── index.Rmd
├── library
│  ├── apa.csl
│  ├── imap.bib
│  └── references.bib
├── reporrt.html
├── report.html
├── resources
│  ├── metadata
│  └── reads
├── results
│  ├── project_tree.txt
│  └── qc
├── smk.css
├── styles.css
├── test.Rmd
└── workflow
├── Snakefile
├── envs
├── report
├── rules
├── schemas
└── scripts
18 directories, 28 files
Troubleshooting of FAQs
- Question
- Question
-
Answer
-
Answer